Chapter 2 Bias assessment
The maps demonstrate the geographic gradients of obesity rate in the state of Utah. The left panel is estimates at the census tract level, while the right panel is estimates at the county level.
UT, Diabetes maps
tmap_arrange(UT_Diabetes_tract, UT_Diabetes_county, ncol=2)NY, Obesity maps
tmap_arrange(NY_Obesity_tract, NY_Obesity_county, ncol=2)NY, Diabetes maps
tmap_arrange(NY_Diabetes_tract, NY_Diabetes_county, ncol=2)The following table presents the census tract level rates of obesity and Diabetes in the state of Utah.
2.1 Bias due to omitted confounders
\[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_2 + \dots + \epsilon_i; \;\; for \;\; i=1, \dots, n\]
where the errors \(\epsilon_i \sim N(0, \sigma^2)\) with independent and identically distributed (i.i.d.)
Let’s assume the following association is true (i.e., gold standard) without any selection bias, measurement bias, and other unmeasured confoundings.
N <- 100000
C <- rnorm(N)
X <- .5 * C + rnorm(N)
Y <- .3 * C + .4 * X + rnorm(N)2.1.1 Gold standard
With the correct model specification (i.e., \(C\) as a confounder), we get an unbiased estimate of \(X\) on \(Y\).
# Gold standard
glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0002838 0.0031611 -0.09 0.928
## X 0.3946515 0.0031562 125.04 <2e-16 ***
## C 0.3000512 0.0035321 84.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9992331)
##
## Null deviance: 140183 on 99999 degrees of freedom
## Residual deviance: 99920 on 99997 degrees of freedom
## AIC: 283716
##
## Number of Fisher Scoring iterations: 2
2.1.2 Misspecified model: a confounder, \(C\), was omitted from the model
By omitting \(C\), the estimate of \(X\) was biased either “away from” or “towards to” the null
# C was omitted
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0009285 0.0032732 -0.284 0.777
## X 0.5140045 0.0029264 175.645 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.071334)
##
## Null deviance: 140183 on 99999 degrees of freedom
## Residual deviance: 107131 on 99998 degrees of freedom
## AIC: 290682
##
## Number of Fisher Scoring iterations: 2
2.1.3 Bias “away from” or “towards to” the null?
N <- 100000
C <- rnorm(N)
X <- -.5 * C + rnorm(N)
Y <- -.3 * C + .4 * X + rnorm(N)# C was omitted
glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.006061 0.003159 -1.919 0.055 .
## X 0.395909 0.003151 125.638 <2e-16 ***
## C -0.303340 0.003536 -85.791 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9977296)
##
## Null deviance: 140646 on 99999 degrees of freedom
## Residual deviance: 99770 on 99997 degrees of freedom
## AIC: 283565
##
## Number of Fisher Scoring iterations: 2
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.005774 0.003273 -1.764 0.0777 .
## X 0.516762 0.002921 176.933 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.071154)
##
## Null deviance: 140646 on 99999 degrees of freedom
## Residual deviance: 107113 on 99998 degrees of freedom
## AIC: 290665
##
## Number of Fisher Scoring iterations: 2
2.1.4 A \(C\) is not a confounder on \(X\) and \(Y\)
N <- 100000
C <- rnorm(N)
X <- rnorm(N)
Y <- .4 * X + rnorm(N)2.1.5 Correct model specification: Without \(C\)
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.002183 0.003155 -0.692 0.489
## X 0.399391 0.003137 127.306 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9951843)
##
## Null deviance: 115645 on 99999 degrees of freedom
## Residual deviance: 99516 on 99998 degrees of freedom
## AIC: 283309
##
## Number of Fisher Scoring iterations: 2
2.1.6 Misspecified model with \(C\)
glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.002165 0.003155 -0.686 0.493
## X 0.399398 0.003137 127.308 <2e-16 ***
## C -0.003133 0.003156 -0.993 0.321
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9951844)
##
## Null deviance: 115645 on 99999 degrees of freedom
## Residual deviance: 99515 on 99997 degrees of freedom
## AIC: 283310
##
## Number of Fisher Scoring iterations: 2
2.1.7 A \(C\) is a colloder on \(X\) and \(Y\)
N <- 100000
X <- rnorm(N)
Y <- .7 * X + rnorm(N)
C <- 1.2 * X + .6 * Y + rnorm(N)2.1.8 Correct model specification: Without \(C\)
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.002391 0.003162 0.756 0.45
## X 0.699815 0.003164 221.198 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9996449)
##
## Null deviance: 148874 on 99999 degrees of freedom
## Residual deviance: 99962 on 99998 degrees of freedom
## AIC: 283756
##
## Number of Fisher Scoring iterations: 2
2.1.9 Misspecified model with \(C\)
This is one of examples of selection bias. For example, let’s say, \(X\) is Education, \(Y\) is income, and \(C\) is social welfare program. People at lower education (i.e., high risk group in terms of exposure) and lower income (i.e., higher risk group in terms of outcome) are more likely to register social welfare program. If survey was conducted based on the registered social welfare program, the “estimated” association from this “disproportionally selected” respondents are likely biased.
glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.003165 0.002712 1.167 0.2432
## X -0.015171 0.004648 -3.264 0.0011 **
## C 0.441410 0.002329 189.493 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.7355337)
##
## Null deviance: 148874 on 99999 degrees of freedom
## Residual deviance: 73551 on 99997 degrees of freedom
## AIC: 253077
##
## Number of Fisher Scoring iterations: 2
2.2 Overadjustment bias
Please note that this is not a comprehensive example; only reflect one aspect of potential overadjustement bias.
Let’s assume a model with \(M\) as a mediator.
N <- 100000
X <- rnorm(N)
M <- .5 * X + rnorm(N)
Y <- .3 * X + .4 * M + rnorm(N)2.3 Total effect
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.003217 0.003408 -0.944 0.345
## X 0.498535 0.003399 146.662 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.161171)
##
## Null deviance: 141091 on 99999 degrees of freedom
## Residual deviance: 116115 on 99998 degrees of freedom
## AIC: 298735
##
## Number of Fisher Scoring iterations: 2
2.4 Overadjustment
glm.unbiased <- glm(Y~X + M, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X + M, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.003242 0.003165 -1.024 0.306
## X 0.300231 0.003528 85.101 <2e-16 ***
## M 0.398702 0.003163 126.039 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.002003)
##
## Null deviance: 141091 on 99999 degrees of freedom
## Residual deviance: 100197 on 99997 degrees of freedom
## AIC: 283993
##
## Number of Fisher Scoring iterations: 2
2.5 Logistic models
2.5.1 Sex as a Confounder, \(C\)
MYY <- data.frame(Sex = "Male",
Smoking = "Yes",
Cancer = 1,
freq = 5
)
MYN <- data.frame(Sex = "Male",
Smoking = "Yes",
Cancer = 0,
freq = 8
)
MNY <- data.frame(Sex = "Male",
Smoking = "No",
Cancer = 1,
freq = 45
)
MNN <- data.frame(Sex = "Male",
Smoking = "No",
Cancer = 0,
freq = 72
)
FYY <- data.frame(Sex = "Female",
Smoking = "Yes",
Cancer = 1,
freq = 25
)
FYN <- data.frame(Sex = "Female",
Smoking = "Yes",
Cancer = 0,
freq = 10
)
FNY <- data.frame(Sex = "Female",
Smoking = "No",
Cancer = 1,
freq = 25
)
FNN <- data.frame(Sex = "Female",
Smoking = "No",
Cancer = 0,
freq = 10
)
Ex_confounder <- rbind(MYY, MYN, MNY, MNN, FYY, FYN, FNY, FNN)Convert Freq table to raw data
library(tidyr)
raw_confounder <- Ex_confounder %>%
uncount(freq)glm.unbiased <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder)
summary(glm.unbiased)##
## Call:
## glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"),
## data = raw_confounder)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1582 0.1627 -0.972 0.3309
## SmokingYes 0.6690 0.3397 1.970 0.0489 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 277.26 on 199 degrees of freedom
## Residual deviance: 273.28 on 198 degrees of freedom
## AIC: 277.28
##
## Number of Fisher Scoring iterations: 4
- Full model:
glm_logit <- glm(Cancer ~ Smoking + Sex , family=binomial(link = "logit"), data=raw_confounder)
glm_logit##
## Call: glm(formula = Cancer ~ Smoking + Sex, family = binomial(link = "logit"),
## data = raw_confounder)
##
## Coefficients:
## (Intercept) SmokingYes SexMale
## 9.163e-01 4.266e-15 -1.386e+00
##
## Degrees of Freedom: 199 Total (i.e. Null); 197 Residual
## Null Deviance: 277.3
## Residual Deviance: 257 AIC: 263
- Stratified models
## For males
raw_confounder_M <- raw_confounder[ which(raw_confounder$Sex=='Male'), ]
glm_logit_m <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder_M)
glm_logit_m##
## Call: glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"),
## data = raw_confounder_M)
##
## Coefficients:
## (Intercept) SmokingYes
## -4.700e-01 6.672e-16
##
## Degrees of Freedom: 129 Total (i.e. Null); 128 Residual
## Null Deviance: 173.2
## Residual Deviance: 173.2 AIC: 177.2
# For females
raw_confounder_F <- raw_confounder[ which(raw_confounder$Sex=='Female'), ]
glm_logit_f <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder_F)
glm_logit_f##
## Call: glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"),
## data = raw_confounder_F)
##
## Coefficients:
## (Intercept) SmokingYes
## 9.163e-01 9.400e-16
##
## Degrees of Freedom: 69 Total (i.e. Null); 68 Residual
## Null Deviance: 83.76
## Residual Deviance: 83.76 AIC: 87.76
2.5.2 Sex as a Moderator, \(M\)
MYY <- data.frame(Sex = "Male",
Smoking = "Yes",
Cancer = 1,
freq = 5
)
MYN <- data.frame(Sex = "Male",
Smoking = "Yes",
Cancer = 0,
freq = 4
)
MNY <- data.frame(Sex = "Male",
Smoking = "No",
Cancer = 1,
freq = 45
)
MNN <- data.frame(Sex = "Male",
Smoking = "No",
Cancer = 0,
freq = 68
)
FYY <- data.frame(Sex = "Female",
Smoking = "Yes",
Cancer = 1,
freq = 25
)
FYN <- data.frame(Sex = "Female",
Smoking = "Yes",
Cancer = 0,
freq = 14
)
FNY <- data.frame(Sex = "Female",
Smoking = "No",
Cancer = 1,
freq = 25
)
FNN <- data.frame(Sex = "Female",
Smoking = "No",
Cancer = 0,
freq = 14
)
Ex_moderator <- rbind(MYY, MYN, MNY, MNN, FYY, FYN, FNY, FNN)Convert Freq table to raw data
library(tidyr)
raw_moderator <- Ex_moderator %>%
uncount(freq)- Full model:
glm_logit <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator)
glm_logit##
## Call: glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"),
## data = raw_moderator)
##
## Coefficients:
## (Intercept) SmokingYes
## -0.1582 0.6690
##
## Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
## Null Deviance: 277.3
## Residual Deviance: 273.3 AIC: 277.3
- Stratified models
## For males
raw_moderator_M <- raw_moderator[ which(raw_moderator$Sex=='Male'), ]
glm_logit_m <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator_M)
glm_logit_m##
## Call: glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"),
## data = raw_moderator_M)
##
## Coefficients:
## (Intercept) SmokingYes
## -0.4128 0.6360
##
## Degrees of Freedom: 121 Total (i.e. Null); 120 Residual
## Null Deviance: 165.1
## Residual Deviance: 164.3 AIC: 168.3
# For females
raw_moderator_F <- raw_moderator[ which(raw_moderator$Sex=='Female'), ]
glm_logit_f <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator_F)
glm_logit_f##
## Call: glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"),
## data = raw_moderator_F)
##
## Coefficients:
## (Intercept) SmokingYes
## 5.798e-01 -2.621e-16
##
## Degrees of Freedom: 77 Total (i.e. Null); 76 Residual
## Null Deviance: 101.8
## Residual Deviance: 101.8 AIC: 105.8